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Abstract 



^ Over the last couple of years it has been realized that the vast computational power of graphics processing units (GPUs) 
I could be harvested for purposes other than the video game industry. This power, which at least nominally exceeds 
that of current CPUs by large factors, results from the relative simplicity of the GPU architectures as compared to 
CPUs, combined with a large number of parallel processing units on a single chip. To benefit from this setup for general 
pH computing purposes, the problems at hand need to be prepared in a way to profit from the inherent parallelism and 
^ hierarchical structure of memory accesses. In this contribution I discuss the performance potential for simulating spin 
models, such as the Ising model, on GPU as compared to conventional simulations on CPU. 
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1. Introduction 

Owing to a combination of an improved toolset of sim- 

ulational machinery and methods of data analysis and the 
exponential increase in available computer power observed 
over the past four decades, computer simulations such as 
the Monte Carlo method have at least drawn level with 
the more traditional perturbative approaches for studying 
a plethora of problems in statistical physics [1], ranging 
from critical phenomena [2] over the physics of disordered 
systems [3] to soft matter and biological problems [4] . This 
success notwithstanding, a range of notoriously hard prob- 
lems appear to create an insatiable appetite for more pow- 
erful computational devices to finally settle a number of 
long-standing questions. Among such problems are, for in- 
stance, the quest of understanding the nature of the spin 
glass phase [5] or the protein folding problem. To achieve 
results beyond the reach of the available standard compu- 
tational resources of the time, there has been a tradition of 
designing special purpose computers, e.g., for calculations 
in lattice field theory [6] or the simulation of spin models 
[7, 8]. 

Since the design and programming of such dedicated 
machines regularly requires a large effort in terms of mon- 
etary and human resources, recently scientists have started 
to adopt the use of graphics processing units for general 
purpose computational tasks in the hope of harvesting 
their nominally vast computational power, on par with 
some devices based on FPGAs, without the need of time- 
consuming work at and near the hardware level [9, 10, 
11]. By design. GPUs are optimized for manipulating a 
large number of graphics primitives in parallel, which of- 
ten amounts to simple, floating-point matrix calculations. 
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In contrast to current CPUs, they are not designed to cope 
with "unexpected" branches in the code, or for execut- 
ing a single-threaded program as fast as possible. While 
this makes GPUs not well suited as drop-in replacements 
for CPUs for interactive computing, their highly parallel 
architecture might well be taken advantage of in scien- 
tific calculations with an often high degree of vectorizable 
or parallelizable code. Their original design for graph- 
ics calculations, however, entails certain design features 
which are not necessarily optimal for scientific computa- 
tional tasks, such as a special hierarchy of memory orga- 
nization or a restriction to (efficient) floating-point calcu- 
lations only in single precision arithmetics, which only has 
been alleviated in the very latest generation of cards. 

While the flrst applications of general purpose comput- 
ing on GPUs were performed directly in graphics program- 
ming languages such as OpenGL [9] , access to these devices 
for scientiflc applications has been considerably simplifled 
with the advent of language extensions such as NVIDIA 
CUDA [12] and OpenCL [13] for performing general pur- 
pose computing on GPUs. The application presented here 
was coded on the NVIDIA architecture using the CUDA 
framework, which is a high-level extension to the C lan- 
guage family. 

2. Relevant features of GPU architecture 

Figure 1 shows a schematic representation of the NVIDIA 
GPUs used in the work presented here. A GPU consists of 
a number of multiprocessors, each composed of a number 
of single processing imits which concurrently work with the 
same code on different parts of a common data set. Of ut- 
most importance to the efficient performance of GPU pro- 
grams is the organization of GPU memory, which comes 
in a number of flavors: 
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Figure 1: Diagrammatic representation of the hardware layout of 
recent NVIDIA CPUs. 

• Registers: each multiprocessor is equipped with sev- 
eral thousand registers, access to which is local to 
each processing unit and extremely fast. 

• Shared memory: the processors combined in a multi- 
processor have access to a small amount (16 KB for 
Tesla cards and 48 KB for the Fermi architecture) of 
shared memory, which serves as a means of synchro- 
nization and communication between the threads in 
a block. This memory resides on-chip and can be 
accessed essentially without significant memory la- 
tency. 

• Global memory: this large amount of memory (cur- 
rently up to 4 GB) is on separate DRAM chips and 
can be accessed by each thread on each multiproces- 
sor. Access suffers from a latency of several hundred 
clock cycles. 

• Constant and texture memory: these memory areas 
are of the same speed as global memory, but they are 
cached such that read access can be very fast. From 
device perspective they are essentially read-only. 

• Host memory: the memory of the host CPU unit 
cannot be accessed from inside GPU calculations. 
Memory transfers between global and device memory 
are important for communication with the "outside 
world" . 

Additionally, the recent Fermi architecture provides cer- 
tain cache memories, but since the previous Tesla archi- 
tecture is used in the present work, I do not discuss them 
here. In the CUDA framework, calculations are organized 
to match the layout of the hardware: each multiprocessor 
executes (part of) a block of threads concurrently, while 
the different blocks of a grid are assigned to separate mul- 
tiprocessors. To alleviate the large latency (in terms of 
clock cycles) of global memory accesses, in an ideal setup 
there are many more threads in total than available pro- 
cessors, such that a different (part of a) thread block can 



be scheduled for execution while the threads of a given 
block wait for memory fetches or writes. 

For maximum performance, implementations of scien- 
tific calculations have to take these characteristics into ac- 
count and, in particular, should ideally meet the following 
design goals: 

1. a large degree of locality of the calculations, reducing 
the need for communication between threads 

2. a large coherence of calculations with a minimum 
occurrence of divergence of the execution paths of 
different threads 

3. a total number of threads significantly exceeding the 
number of available processing units 

4. a large overhead of arithmetic operations and shared 
memory accesses over global memory accesses 

3. Double checkerboard Metropolis simulations 

As a typical application in statistical physics, I studied 
the single-spin flip Metropolis [14] simulation of a nearest- 
neighbor, ferromagnetic Ising model with Hamiltonian 

'H = -J^^SiSj, Si = ±1 (1) 

on square and simple cubic lattices of edge length L, using 
periodic boundary conditions. A proposed flip of spin Si 
is accepted with the Metropolis probablity 

P.cc{s^ ^ -s.) = min [1, e-^^^] , (2) 

such that the updating decision can be drawn solely upon 
examining the states of spin and its four (in 2D) resp. 
six (in 3D) neighbors. Hence, the necessary calculations 
can be made local and highly parallel by using lattice de- 
compositions of the checkerboard type. The authors of 
Rcf. [11] used a single checkerboard decomposition, work- 
ing on strips or columns of the lattice. Since this setup 
does not take the hierarchical memory organization into 
account, the spin fleld needs to reside in global memory at 
all times, such that memory accesses are very costly. Here, 
instead, I suggest to use a double checkerboard decomposi- 
tion, whose organization is in line with the hierarchic lay- 
out of GPU memory: for the square-lattice system, on a 
flrst, "coarse" level, the lattice is divided into BxB blocks. 
On a second, "flue" level, each block is decomposed, again 
in a checkerboard fashion, into T x T sub-blocks. This is 
illustrated in Fig. 2. As a consequence of this decomposi- 
tion, each large tile of one of the two sub-lattices ("even" 
and "odd" ) of the coarse decomposition can be updated in- 
dependently, and for each tile under consideration all sites 
of one sub-lattice are again independent of each other. It is 
thus possible to load the conflguration of spins of one of the 
coarse tiles into shared memory, including an extra surface 
layer of neighboring spins needed for calculating the local 
energy of spins in the considered tile, cf. the shaded area 
in Fig. 2. This loading operation is distributed over the 
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threads of a block, arranging memory accesses to achieve 
coalescence [12]. 

In total, the simulation thus proceeds as follows: 

1. A kernel is launched assigning all i?^/2 even tiles of 
the coarse checkerboard to a separate thread block, 
all of which are (depending on the number of multi- 
processors available in hardware) executed in paral- 
lel. 

2. The T-^/2 threads of each thread block cooperatively 
load the spin configuration of their tile plus a bound- 
ary layer into shared memory. 

3. The threads of each block perform a Metropolis up- 
date of each even lattice site in their tile in parallel. 

4. The threads of each block are synchronized, ensuring 
that all of them have completed the previous step. 

5. The threads of each block perform a Metropolis up- 
date of each odd lattice site in their tile in parallel. 

6. The threads of each block are again synchronized. 

7. A second kernel is launched working on the 5^/2 odd 
tiles of the coarse checkerboard in the same fashion 
as for the even tiles. 

In practice, the kernels for even and odd sub-lattices 
can be implemented as calls to the same kernel, using 
an extra offset parameter to distinguish sub-lattices. To 
leverage the effect of loading a tile's spin configuration 
into shared memory, a generalized multi-hit technique [15] 
is employed for performing the simulations, where steps 
3-6 above are repeated k times. In this way, one sub- 
lattice of the coarse checkerboard is updated several times 
before updating the other sub-lattice. Close to critical- 
ity, the generalized multi-hit approach leads to somewhat 
increased autocorrelation times [16], which reduces the 
overall efficiency of the implementation presented here in 
the vicinity of a critical point. In view of the existence 
of efficient cluster algorithms for this case [17], however, 
single-spin flip Metropolis simulations arc not the algo- 
rithm of choice for this situation, anyway. The code for 
the Metropolis kernel formulated here is extremely sim- 
ple, taking up only around 60 lines (vs. around 300 lines 
in the implementation presented in Ref. [11]). It can be 
downloaded from the author's website [18]. 

4. Results for the Ising model 

To actually perform the Metropolis updates, a stream 
of pseudo-random numbers is required. It is clear that, 
for reasonable efficiency, each thread needs to have access 
to an independent (sub-)stream of random numbers. For 
simplicity and the sake of comparison, I here use an array 
of simple 32-bit linear congruential generators (LCG) with 
identical multipliers, but randomly chosen initial seeds for 
each thread [11]. It is clear that in view of the short pe- 
riod p = iP' ~ 10^ of the generators, most of the different 
sequences will have significant overlap and, e.g., in a simu- 
lation with 10^ Monte Carlo sweeps of a 1024 x 1024 system 
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Figure 2: Double checkerboard decomposition of a 32'^ square lattice 
for parallel Metropolis simulations on GPU. Each of the B X B = 
4x4 big tiles is assigned as a thread block to a multiprocessor, 
whose individual processors work on one of the two sub-lattices of 
all T X T = 8 X 8 sites of the tile in parallel. 

about lO^'^ random numbers are used, significantly exceed- 
ing the period of the generator, and even more dramati- 
cally exceeding the value considered to be safe when 
using LCGs [19]. Somewhat surprisingly, for the 2D model 
all simulation data are consistent with the exact results for 
the internal energy and specific heat [20] with this setup. 
On the contrary, when using an actually cleaner setup with 
disjoint sub-sequences of the 32-bit LCG, and even when 
using disjoint sequences of an analogous 64-bit LCG with 
period p — 2^"* « 10^^, highly significant deviations are 
encountered. For high-precision real-world applications, 
therefore, I suggest to use different pseudo-random num- 
ber generators, for instance of the Lagged Fibonacci type 
[21]. The corresponding implementations will be discussed 
elsewhere [16] . 

For the 2D model, in Fig. 3 the times for performing 
a single spin flip are presented as a function of the linear 
system size L. The time required for the measurement of 
elementary quantities such as the energy and magnetiza- 
tion is not included in these figures, since pure spin-flip 
times over the years have developed into a standard unit 
for comparing different architectures and implementations 
and thus allow to compare to a host of previous calcula- 
tions. GPU calculations have been performed here on a 
Tesla C1060 device with 4 GB of RAM. By experimenta- 
tion, for the considered system sizes 16 < L < 1024, the 
optimal tile sizes are found to be T = 4 for L < 64, T = 8 
for L = 128 and T = 16 for 128 < L < 1024. Using shared 
memory and the multi-hit technique, single spin flip times 
down to about 0.1 ns can be achieved, significantly exceed- 
ing the performance reported in Ref. [11]. When compar- 
ing these results to CPU calculations, the question arises 
whether multiple CPU cores should be taken into account 
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Figure 3: Computer times for a single spin flip of a Metropolis update 
simulation of the 2D Ising model on a square lattice of edge length L 
using the double checkerboard decomposition and fc-fold generalized 
multi-hit updates. GPU times are for a Tesla C1060 device and CPU 
times for 3.0 GHz Intel Core 2 Quad processors with 4 MB and 6 
MB of cache, respectively. 

[22]. I refrain her from doing so, and use serial CPU code 
as the de facto standard of code used in most simulations 
on single CPUs. The CPU code used in Ref. [11] was a 
one-to-one copy of the GPU code. Just replacing it by 
code more suitable for serial execution already results in a 
speed-up by a factor of two. This observation, as well as 
the cache effect clearly visible in Fig. 3 as the size of the 
spin field of bytes reaches the size of the cache, indicate 
that speed-up factors are a rather fragile measure of GPU 
vs. CPU performance. Trying a relatively fair compari- 
son, using the somewhat optimized code on a CPU with 
sufficiently large cache, results in the speed-up factors pre- 
sented in Fig. 4. Whereas compared to the CPU code used 
in Ref. [11] speed-ups of up to 400 are observed, for the 
more realistic comparison used here, a maximal speed-up 
of around 100 is reached (vs. a speed-up of around 20 for 
the GPU code of Ref. [11]). The double checkerboard de- 
composition proceeds in a completely analogous way for 
the case of the 3D Ising model, and in this case we achieve 
a maximum performance of around 0.24 ns per single spin 
flip with maximal speed-ups of almost 300 compared to the 
corresponding CPU code for a 256'^ system (for this lat- 
tice size, the spin configuation is significantly larger than 
the cache memory if using regular integer variables for the 
spins) . 

It is obvious that the chosen problem and implemen- 
tation come rather close to meeting the design goals set 
out in Sec. 2 and thus constitute a quite ideal applica- 
tion. Indeed, we achieve a total throughput in excess of 
100 GFLOP/s from the chosen implementation which is 
at least of the same order of magnitude as the theoretical 
peak performance of 933 GFLOP/s for the Tesla C1060 
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Figure 4: Speed-up factors of the double checkerboard GPU imple- 
mentation for the 2D Ising model vs. the CPU code as a function of 
linear system size L. 

card. The outlined approach easily generalizes to simu- 
lations of more general spin models, in particular models 
with continuous spins such as the Heisenberg model, where 
the large efhcieny of GPU devices with (single-precision) 
floating-point calculations comes into play. For the case of 
disordered models, parallelism is also possible by working 
on many disorder realizations concurrently. Combining 
such approaches with (asynchronous) multi-spin coding, 
we achieve a performance of around 0.15 ps per single spin 
flip for the Ed wards- Anderson Ising spin glass. These and 
further extensions will be discussed in a separate publica- 
tion [16]. 
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